WL-TR-96-1 065 


v 


MULTI-AGENT  RESIDUAL  ADVANTAGE 
LEARNING  WITH  GENERAL  FUNCTION 
APPROXIMATION 


Authors: 


Mance  E.  Harmon,  Lt 
Avionics  Control  Engineer 
Avionics  Directorate 
Wright  Laboratory 

Wright-Patterson  Air  Force  Base  OH  45433-7308 


Leemon  C.  Baird,  III,  Capt 
Computer  Science  Instructor 
Department  of  Computer  Science 
USAF  Academy 
USAFA  CO  80840-6234 


APRIL  1996 


FINAL  REPORT  APRIL  1996 


Approved  for  public  release;  distribution  unlimited 


19960604  075 


AVIONICS  DIRECTORATE 

WRIGHT  LABORATORY 

AIR  FORCE  MATERIEL  COMMAND 

WRIGHT-PATTERSON  AIR  FORCE  BASE,  OH  45433-7409 


j^C  QUALITY  IHSPBCTSDl 


NOTICE 


When  Government  drawings,  specifications,  or  other  data  are  used  for  any  purpose 
other  than  in  connection  with  a  definitely  Government-related  procurement,  the  United 
States  Government  incurs  no  responsibility  or  any  obligation  whatsoever.  The  fact  that  the 
government  may  have  formulated  or  in  any  way  supplied  the  said  drawings,  specifications, 
or  other  data,  is  not  to  be  regarded  by  implication,  or  otherwise  in  any  manner  construed, 
as  licensing  the  holder,  or  any  other  person  or  corporation;  or  as  conveying  any  rights  or 
permission  to  manufacture,  use,  or  sell  any  patented  invention  that  may  in  any  way  be 
related  thereto. 

This  report  is  releasable  to  the  National  Technical  Information  Service  (NTIS).  At 
NTIS,  it  will  be  available  to  the  general  public,  including  foreign  nations. 

This  technical  report  has  been  reviewed  and  is  approved  for  publication. 


A7  -£1  a 

Primary  Investigator 
WL/AACF 


Mr.  William  Baker,  Chief 
Fusion  Technology  Branch, 
Combat  Information  Division, 
Wright  Laboratory 


STEPHEN  G.  PETERS,  LtCol,  USAF 
Deputy  Chief 

Combat  Information  Technology  Division 


If  your  address  has  changed,  if  you  wish  to  be  removed  from  our  mailing  list,  or  if  the 
addressee  is  no  longer  employed  by  your  organization  please  notify  WL/AA,  WPAFB,  OH 
45433-7318  to  help  us  maintain  a  current  mailing  list. 

Copies  of  this  report  should  not  be  returned  unless  return  is  required  by  security 
considerations,  contractual  obligations,  or  notice  on  a  specific  document. 


REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this 
collection  of  information,  including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  tor  information  Operations  and  Reports,  1215  Jefferson 
Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302.  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 

1.  AGENCY  USE  ONLY  (Leave  blank)  2.  REPORT  DATE  3.  REPORT  TYPE  AND  DATES  COVERED 

3  April  96  FINAL  4/3/96 

4.  TITLE  AND  SUBTITLE 

Multi- Agent  Residual  Advantage  Learning  with  General 

Function  Approximation 

5.  FUNDING  NUMBERS 

PE  61102F 

PR  2312 

TA  R1 

WU  02 

6.  AUTHOR(S) 

Mance  E.  Harmon,  WL/AACF 

T  .ee  CL  JtairH  TTSAFA 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Avionics  Directorate 

Wright  Laboratory 

Air  Force  Materiel  Command 

Wright  Patterson  Air  Force  Base,  Ohio  45433-7409 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

WL-TR-96-1065 

9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

Avionics  Directorate 

Wright  Laboratory 

Air  Force  Materiel  Command 

Wright  Patterson  Air  Force  Base,  Ohio  45433-7409 

10.  SPONSORING  /  MONITORING 

AGENCY  REPORT  NUMBER 

WL-TR-96-1065 

1 11.  SUPPLEMENTARY  NOTES 

112a.  DISTRIBUTION /AVAILABILITY  STATEMENT 

Approved  for  Public  Release;  Distribution  is  Unlimited 

12b.  DISTRIBUTION  CODE 

13.  ABSTRACT  (Maximum  200  words) 

A  new  algorithm  advantage  learning,  is  presented  that  improves  on  advantage  updating  by  requiring 
that  a  single  function  be  learned  rather  than  two.  Furthermore,  advantage  learning  requires  only  a 
single  type  of  update,  the  learning,  while  advantage  updating  requires  two  different  types  of  updates, 
a  learning  update  and  a  normalization  update.  The  reinforcement  learning  system  uses  the  residual 
form  of  advantage  learning.  An  application  of  reinforcement  learning  to  a  Markov  game  is 
presented.  The  test-bed  has  continuous  states  and  nonlinear  dynamics.  The  advantage  function  is 
stored  in  a  single-hidden-layer  sigmoidal  network.  Speed  of  learning  is  increased  by  a  new 
algorithm,  Incremental  Delta-Delta  (IDD),  which  extends  Jacob’s  (1988)  Delta-Delta  for  use  in 
incremental  training,  and  differs  from  Sutton’s  Incremental  Delta-Bar-Delta  (1992)  in  that  it  does 
not  require  the  use  of  a  trace  and  is  amenable  for  use  with  general  function  approximation  systems. 

To  our  knowledge,  this  is  the  first  time  an  approximate  second  order  method  has  been  used  with 
residual  algorithms.  Empirical  results  are  presented  comparing  convergence  rates  with  and  without 
the  use  of  IDD  for  the  reinforcement  learning  test-bed  and  for  a  supervised  learning  test-bed. 

14.  SUBJECT  TERMS 

15.  NUMBER  OF  PAGES 

20 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION  18.  SECURITY  CLASSIFICATION  19.  SECURITY  CLASSIFICATION 

OF  REPORT  OF  THIS  PAGE  OF  ABSTRACT 

UNCLASSIFIED  UNCLASSIFIED  UNCLASSIFIED 

20.  LIMITATION  OF  ABSTRACT 

SAR 

NSN  7540-01-280-5500  Standard  Form  298  (Rev.  2-89) 


Prescribed  by  ANSI  Std.  Z39-18 
298-102 


PREFACE 


Multi- Agent  Residual  Advantage  Learning  With  General 

Function  Approximation 


Mance  E.  Harmon 

Wright  Laboratory 
WL/AACF 
2241  Avionics  Circle 
Wright-Patterson  Air  Force  Base, 
OH  45433-7308 
harmonme@aa.wpafb.mil 


Leemon  C.  Baird  III 

U.S.A.F.  Academy 
2354  Fairchild  Dr.  Suite  6K41 
USAFA,  CO  80840-6234 
baird@cs.usafa.af.mil 


Abstract 

A  new  algorithm,  advantage  learning,  is  presented  that  improves  on  advantage 
updating  by  requiring  that  a  single  function  be  learned  rather  than  two.  Furthermore, 
advantage  learning  requires  only  a  single  type  of  update,  the  learning  update,  while 
advantage  updating  requires  two  different  types  of  updates,  a  learning  update  and  a 
normilization  update.  The  reinforcement  learning  system  uses  the  residual  form  of 
advantage  learning.  An  application  of  reinforcement  learning  to  a  Markov  game  is 
presented.  The  test-bed  has  continuous  states  and  nonlinear  dynamics.  The  game 
consists  of  two  players,  a  missile  and  a  plane;  the  missile  pursues  the  plane  and  the 
plane  evades  the  missile.  On  each  time  step,  each  player  chooses  one  of  two  possible 
actions;  turn  left  or  turn  right,  resulting  in  a  90  degree  instantaneous  change  in  the 
aircraft’s  heading.  Reinforcement  is  given  only  when  the  missile  hits  the  plane  or  the 
plane  reaches  an  escape  distance  from  the  missile.  The  advantage  function  is  stored 
in  a  single-hidden-layer  sigmoidal  network.  Speed  of  learning  is  increased  by  a  new 
algorithm,  Incremental  Delta-Delta  (IDD),  which  extends  Jacobs’  (1988)  Delta-Delta 
for  use  in  incremental  training,  and  differs  from  Sutton’s  Incremental  Delta-Bar-Delta 
(1992)  in  that  it  does  not  require  the  use  of  a  trace  and  is  amenable  for  use  with 
general  function  approximation  systems.  The  advantage  learning  algorithm  for 
optimal  control  is  modified  for  Markov  games  in  order  to  find  the  minimax  point, 
rather  than  the  maximum.  Empirical  results  gathered  using  the  missile/aircraft  test¬ 
bed  validate  theory  that  suggests  residual  forms  of  reinforcement  learning  algorithms 
converge  to  a  local  minimum  of  the  mean  squared  Bellman  residual  when  using 
general  function  approximation  systems.  Also,  to  our  knowledge,  this  is  the  first 
time  an  approximate  second  order  method  has  been  used  with  residual  algorithms. 
Empirical  results  are  presented  comparing  convergence  rates  with  and  without  the  use 
of  IDD  for  the  reinforcement  learning  test-bed  described  above  and  for  a  supervised 
learning  test-bed.  The  results  of  these  experiments  demonstrate  IDD  increased  the 
rate  of  convergence  and  resulted  in  an  order  of  magnitude  lower  total  asymptotic  error 
than  when  using  backpropagation  alone. 
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1  INTRODUCTION 

In  Harmon,  Baird,  and  Klopf  (1995)  it  was  demonstrated  that  the  residual  gradient  form  of  the 
advantage  updating  algorithm  could  learn  the  optimal  policy  for  a  linear-quadratic  differential  game 
using  a  quadratic  function  approximation  system.  We  propose  a  simpler  algorithm,  advantage 
learning,  which  retains  the  properties  of  advantage  updating  but  requires  only  one  function  to  be 
learned  rather  than  two.  A  faster  class  of  algorithms,  residual  algorithms,  is  proposed  in  (Baird, 
95).  We  present  empirical  results  demonstrating  the  residual  form  of  advantage  learning  solving  a 
nonlinear  game  using  a  general  neural  network.  The  game  is  a  Markov  decision  process  (MDP) 
with  continuous  states  and  nonlinear  dynamics.  The  game  consists  of  two  players,  a  missile  and  a 
plane;  the  missile  pursues  the  plane  and  the  plane  evades  the  missile.  On  each  time  step  each  player 
chooses  one  of  two  possible  actions;  turn  left  or  turn  right,  which  results  in  a  90  degree 
instantaneous  change  in  heading  for  the  aircraft.  Reinforcement  is  given  only  when  the  missile 
either  hits  or  misses  the  plane.  The  advantage  function  is  stored  in  a  single-hidden-layer  sigmoidal 
network.  Rate  of  convergence  is  increased  by  a  new  algorithm  we  call  Incremental  Delta-Delta 
(IDD),  which  extends  Jacobs’  (1988)  Delta-Delta  for  use  in  incremental  training,  as  opposed  to 
epoch- wise  training.  IDD  differs  from  Sutton’s  Incremental  Delta-Bar-Delta  (1992)  in  that  it  does 
not  require  the  use  of  a  trace,  averages  of  recent  values,  and  is  useful  for  general  function 
approximation  systems.  The  advantage  learning  algorithm  for  optimal  control  is  modified  for 
Markov  games  in  order  to  find  the  minimax  point,  rather  than  the  maximum.  Empirical  results 
gathered  using  the  missile/aircraft  test-bed  validate  theory  that  suggests  residual  forms  of 
reinforcement  learning  algorithms  converge  to  a  local  minimum  of  the  mean  squared  Bellman 
residual  when  using  general  function  approximation  systems.  Also,  to  our  knowledge,  this  is  the 
first  time  an  approximate  second  order  method  has  been  used  with  residual  algorithms,  and  we 
present  empirical  results  comparing  convergence  rates  with  and  without  the  use  of  IDD  for  the 
reinforcement  learning  test-bed  described  above  and  for  a  supervised-leaming  test-bed. 

In  Section  2  we  present  advantage  learning  and  describe  its  improvements  over  advantage 
updating.  In  Section  3  we  review  direct  algorithms,  residual  gradient  algorithms,  and  residual 
algorithms.  In  Section  4  we  present  a  brief  discussion  of  game  theory  and  review  research  in 
which  game  theory  has  been  applied  to  MDP-like  environments.  In  Section  5  we  present 
Incremental  Delta-Delta  (IDD),  an  incremental,  nonlinear  extension  to  Jacobs’  (1988)  Delta-Delta 
algorithm.  Also  presented  in  Section  5  are  empirical  results  generated  from  an  application  of  the 
IDD  algorithm  to  a  nonlinear  supervised  learning  task.  Section  6  explicitly  describes  the 
reinforcement  learning  testbed  and  presents  the  update  equations  for  residual  advantage  learning. 
Simulation  results  generated  using  the  missile/aircraft  test-bed  are  presented  and  discussed  in 
Section  7.  These  results  include  diagrams  of  learned  behavior,  a  comparison  of  the  system’s 

ability  to  reduce  the  mean  squared  Bellman  error  for  different  values  of  <|>  (including  an  adaptive  <|>), 
and  a  comparison  of  the  system’s  performance  with  and  without  the  use  of  IDD. 

2  BACKGROUND 

2.1  Advantage  Updating 

The  advantage  updating  algorithm  (Baird,  1993)  is  a  reinforcement  learning  algorithm  in  which 
two  types  of  information  are  stored.  For  each  state  x,  the  value  V(x)  is  stored,  representing  an 
estimate  of  the  total  discounted  return  expected  when  starting  in  state  x  and  performing  optimal 
actions.  For  each  state  x  and  action  u,  the  advantage ,  A(x,u),  is  stored,  representing  an  estimate  of 
the  degree  to  which  the  expected  total  discounted  reinforcement  is  increased  by  performing  action  u 

rather  than  the  action  currently  considered  best.  The  optimal  value  function  V*(jc)  represents  the 
true  value  of  each  state.  The  optimal  advantage  function  A*(x,u)  will  be  zero  if  u  is  the  optimal 
action  (because  u  confers  no  advantage  relative  to  itself)  and  A*(x,u)  will  be  negative  for  any 
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suboptimal  u  (because  a  suboptimal  action  has  a  negative  advantage  relative  to  the  best  action). 
Advantage  updating  has  been  shown  to  learn  faster  than  Q-leaming  (Watkins,  1989),  especially  for 
continuous-time  problems  (Baird,  1993,  Harmon,  Baird,  &  Klopf,  1995). 

2.2  Advantage  Learning 


Advantage  learning  improves  on  advantage  updating  by  requiring  only  a  single  function  to  be 
stored,  die  advantage  function  A(x,u).  Furthermore,  advantage  updating  requires  two  types  of 
updates  (learning  and  normalizing  updates),  while  advantage  learning  requires  only  a  single  type  of 
update  (the  learning  update).  For  each  state-action  pair  (x,u),  the  advantage  A(x,u)  is  stored, 
representing  the  utility  (advantage)  of  performing  action  u  rather  than  the  action  currently 
considered  best.  The  optimal  advantage  function  A*(x,u)  represents  the  true  advantage  of  each 
state-action  pair.  The  value  of  a  state  is  defined  as: 


V'ft)  =  max  A*  (x,u) 

u 

The  advantage  A*(x,u)  for  state  x  and  action  u  is  defined  to  be: 


A\x,u)  =  V\x)  + 


(/f+7Y(f))-f*(r) 
a ~tk 


(1) 

(2) 


where  y41  is  the  discount  factor  per  time  step,  AT  is  a  time  unit  scaling  factor,  and  <>  represents  the 
expected  value  over  all  possible  results  of  performing  action  u  in  state  x  to  receive  immediate 
reinforcement  R  and  to  transition  to  a  new  state  x’ .  Under  this  definition,  an  advantage  can  be 
thought  of  as  the  sum  of  the  value  of  the  state  plus  the  expected  rate  at  which  performing  u 
increases  the  total  discounted  reinforcement.  For  optimal  actions  the  second  term  is  zero,  meaning 
the  value  of  the  action  is  also  the  value  of  the  state;  for  suboptimal  actions  the  second  term  is 
negative,  representing  the  degree  of  suboptimality  relative  to  the  optimal  action. 


3  REINFORCEMENT  LEARNING  WITH  CONTINUOUS  STATES 


3.1  Direct  Algorithms 

For  predicting  the  outcome  of  a  Markov  chain  (a  degenerate  MDP  for  which  there  is  only  one 
possible  action),  an  obvious  algorithm  is  an  incremental  form  of  value  iteration,  which  is  defined 
as: 


V(x )  <-(l-  a)V{x)  +  a[R  +  yf(jt’ )]  (3) 

If  V(x)  is  represented  by  a  function-approximation  system  other  than  a  look-up  table,  update  (3) 
can  be  implemented  directly  by  combining  it  with  the  backpropagation  algorithm  (Rumelhart, 
Hinton,  &  Williams,  86).  For  an  input  x,  the  actual  output  of  the  function-approximation  system 

would  be  V(x),  the  “desired  output”  used  for  training  would  be  R+yV(x’),  and  all  of  the  weights 
would  be  adjusted  through  gradient  descent  to  make  the  actual  output  closer  to  the  desired  output. 
Equation  (3)  is  exactly  the  TD(0)  algorithm,  and  could  also  be  called  the  direct  implementation  of 
incremental  value  iteration,  Q-leaming,  and  advantage  learning. 

3.2  Residual  Gradient  Algorithms 

Reinforcement  learning  algorithms  can  be  guaranteed  to  converge  for  lookup  tables,  yet  be  unstable 
for  function-approximation  systems  that  have  even  a  small  amount  of  generalization  when  using 
the  direct  implementation  (Boyan,  95).  To  find  an  algorithm  that  is  more  stable  than  the  direct 
algorithm,  it  is  useful  to  specify  the  exact  goal  for  the  learning  system.  For  the  problem  of 
prediction  on  a  deterministic  Markov  chain,  the  goal  can  be  stated  as  finding  a  value  function  such 
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that,  for  any  state  x  and  its  successor  state  x’ ,  with  a  transition  yielding  immediate  reinforcement 
R,  the  value  function  will  satisfy  the  Bellman  equation: 

V(.x)  =  (R  +  m *)>  (4) 

For  a  given  value  function  V,  and  a  given  state  x,  the  Bellman  residual  is  defined  to  be  the 
difference  between  the  two  sides  of  the  Bellman  equation.  The  mean  squared  Bellman  residual  for 
an  MDP  with  n  states  is  therefore  defined  to  be: 

£  =  -X[(^  +  ^)>-VW]2  (5) 

ft  X 

Residual  gradient  algorithms  change  the  weights  in  the  function-approximation  system  by 
performing  gradient  descent  on  the  mean  squared  Bellman  residual,  E.  This  is  called  the  residual 
gradient  algorithm. 

The  counterpart  of  the  Bellman  equation  for  advantage  learning  is: 

A\x,u)  =  (r  +  y^A^V  +  (6) 

If  A(x,u)  is  an  approximation  of  A*(x,u),  then  the  mean  squared  Bellman  residual,  E,  is: 

where  the  inner  <>  is  the  expected  value  over  all  possible  results  of  performing  a  given  action  u  in 
a  given  state  x,  and  the  outer  <>  is  the  expected  value  over  all  possible  states  and  actions. 

3.3  Residual  Algorithms 

Direct  algorithms  can  be  fast  but  unstable,  and  residual  gradient  algorithms  may  be  stable  but  slow. 
Direct  algorithms  attempt  to  make  each  state  like  its  successor,  but  ignore  the  effects  of 
generalization  during  learning.  Residual  gradient  algorithms  take  into  account  the  effects  of 
generalization,  but  attempt  to  make  each  state  match  both  its  successor  and  its  predecessors.  A 
weighted  average  of  a  direct  algorithm  with  a  residual  gradient  algorithm  could  have  guaranteed 

convergence  if  the  weighting  factor  (f>  is  were  chosen  correctly.  Such  an  algorithm  would  cause  the 
mean  squared  Bellman  residual  to  decrease  monotonically,  but  would  not  necessarily  follow  the 
negative  gradient,  which  would  be  the  path  of  steepest  descent.  Therefore,  it  would  be  reasonable 
to  refer  to  such  algorithms  as  residual  algorithms  (Baird,  1995). 

There  is  the  question  of  how  to  choose  <|>  appropriately.  One  approach  is  to  treat  it  as  a  constant, 
like  the  learning  rate  constant.  Just  as  a  learning  rate  constant  can  be  chosen  to  be  as  high  as 

possible  without  causing  the  weights  to  blow  up,  so  <j)  can  be  chosen  as  close  to  0  as  possible 

without  the  weights  blowing  up.  A  <j)  of  1  is  guaranteed  to  converge,  and  a  <(>  of  0  might  be 
expected  to  learn  quickly  if  it  is  stable  at  all.  However,  this  may  not  be  the  best  approach.  It 
requires  an  additional  parameter  to  be  chosen  by  trial  and  error,  and  it  ignores  the  fact  that  the  best 

<j>  to  use  initially  might  not  be  the  best  <j>  to  use  later,  after  the  system  has  learned  for  some  time. 

Fortunately,  it  is  easy  to  calculate  the  <J>  that  ensures  a  decreasing  mean  squared  residual,  while 
bringing  the  weight  change  vector  as  close  to  the  direct  algorithm  as  possible  (described  in  Section 
6). 
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4  MARKOV  GAMES 


The  theory  of  Markov  decision  processes  (Barto  et  al.,  1989,  Howard  1960)  is  the  basis  for  most 
of  the  recent  reinforcement  learning  theory.  However,  this  body  of  theory  assumes  that  the  agent’s 
environment  is  stationary  and,  therefore,  contains  no  other  adaptive  agents.  The  theory  of  games 
(von  Neumann  and  Morgenstem,  1947)  is  explicitly  designed  for  reasoning  about  multi-agent 
environments.  Markov  games  (Van  Der  Wal,  1981)  is  an  extension  of  game  theory  to  MDP-like 
environments,  and  was  proposed  by  Littman  (1994)  as  a  framework  for  multi-agent  reinforcement 
learning. 

Differential  games  (Isaacs,  1965)  are  Markov  games  played  in  continuous  time,  or  use  sufficiently 
small  time  steps  to  approximate  continuous  time.  Both  players  evaluate  the  given  state  and 
simultaneously  execute  an  action,  with  no  knowledge  of  the  other  player's  selected  action.  The 
value  of  a  game  is  the  long-term,  discounted  reinforcement  if  both  opponents  play  the  game 
optimally  in  every  state.  Consider  a  game  in  which  player  A  tries  to  minimize  the  total  discounted 
reinforcement,  while  the  opponent,  player  B,  tries  to  maximize  the  total  discounted  reinforcement. 
Given  the  advantage  A(x,ua,ub )  for  each  possible  action  in  state  x,  it  is  useful  to  define  the 
minimax  and  maximin  values  for  state  x  as: 

minimax(x)=  min  max  A  (x,  (8) 

“A  «« 

maximin(x)  =  max  min  A (x,  uA ,  uB )  (9) 

“«  “a 

If  the  minimax  equals  the  maximin,  then  the  minimax  is  called  a  saddlepoint  and  the  optimal  policy 
for  both  players  is  to  perform  the  actions  associated  with  the  saddlepoint.  If  a  saddlepoint  does  not 
exist,  then  the  optimal  policy  is  stochastic  if  an  optimal  policy  exists  at  all.  If  a  saddlepoint  does 
not  exist,  and  a  learning  system  treats  the  minimax  as  if  it  were  a  saddlepoint,  then  the  system  will 
behave  as  if  player  A  must  choose  an  action  on  each  time  step,  and  then  player  B  chooses  an  action 
based  upon  the  action  chosen  by  A.  For  the  algorithms  described  below,  a  saddlepoint  is  assumed 
to  exist.  If  a  saddlepoint  does  not  exist,  this  assumption  confers  a  slight  advantage  to  player  B. 

5  INCREMENTAL  DELTA-DELTA  (IDD)  FOR  NONLINEAR  FUNCTION 
APPROXIMATORS 

5.1  IDD  Derivation 

Incremental  Delta-Bar-Delta  (IDBD)  was  proposed  by  Sutton  (1992)  as  an  extension  to  the  Delta- 
Bar-Delta  algorithm  (Jacobs,  1988)  that  makes  the  algorithm  amenable  to  incremental  tasks 
(learning  tasks  in  which  examples  are  processed  one  by  one  and  then  discarded).  The  IDBD 
algorithm  was  described  by  Sutton  as  a  meta-learning  algorithm  in  the  sense  that  it  learns  the 
learning-rate  parameters  of  an  underlying  base  learning  system.  In  Sutton  (1992),  the  base 
learning  system  was  the  Least-Mean-Square  (LMS)  rule,  also  known  as  the  Widrow-Hoff  rule 
(Widrow  and  Steams,  1985),  and  the  IDBD  algorithm  was  derived  for  linear  function 
approximation  systems.  Here,  we  present  an  extension  to  Jacobs  (1988)  Delta-Delta  algorithm  that 
is  appropriate  for  incremental  training  when  using  nonlinear  function  approximation  systems. 

As  in  the  IDBD  algorithm,  in  IDD  each  parameter  of  the  the  neural  network  has  an  associated 
learning  rate  of  the  form 

ai(t)  =  ePiW  (10) 

(where  i  indicates  the  parameter  of  association)  that  is  updated  after  each  step  of  learning.  There 
are  two  advantages  resulting  from  the  exponential  relationship  of  the  learning  rate,  ai;  and  the 
memory  parameter  that  is  actually  modified,  (5;.  First,  this  assures  that  a;  will  always  be  positive. 
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Second,  it  allows  geometric  steps  in  aL.  As  (3,  is  incremented  or  decremented  by  a  fixed  step-size, 
then  a,  will  move  up  or  down  by  a  fraction  of  its  current  value,  allowing  a,  to  become  very  small. 
EDD  updates  J3,  by 


A  (f  + 1)  =  A  (0  +  ^—7^  Aw,  (t) 

a,-(0 


(11) 


where  Aw,  is  a  change  in  the  weight  parameter  w,,  and  0  is  the  meta-leaming  rate.  The  derivation 
of  EDD  is  similar  in  principle  to  that  of  EDBD  and  is  presented  below.  We  start  with 


A(*+D  =  A(O-0 


<9j(E2(r  + 1)) 
dp,  it) 


(12) 


where  (£2)  is  the  expected  value  of  the  mean  squared  error.  Applying  the  chain  rule  we  may 
write 


d\(E2 (r  + 1))  _d\(E2(t  + 1))  dw, (t  + 1)  da, it) 
dm  ~  dw,it  +  \)  da,  (t)  dm 


(13) 


By  evaluating  the  last  term  of  equation  (13),  shown  in  equation  (14),  and  then  substituting  the 
results  we  arrive  at  equation  (15). 


da.it)  dep‘ 


=  ePi  =  a,(t) 


dfiiit)  dp, 
d\{E2(t  + 1))  _  d\(E2{t  + 1))  dw.jt +  1) 


dp,  (t) 


dw,  {t  + 1)  da,  ( t ) 


a, it) 


(14) 


(15) 


By  evaluating  the  next  to  last  term  of  equation  (15)  and  rearranging,  we  find  the  equality  described 
by  equation  (16). 


dw.it  + 1)_  d 
da, it)  da, it) 


w, it) -a, it) 


dj(E2V  + 1)) 
dw,  it) 


dw.jt  + 1)_  ^j(£2(r  +  l)) 
da, it)  dw,  it) 

Again,  substituting  the  results  of  equation  (16)  into  (15)  produces 

d\(E2it  + 1))  d\(E2it  +  Y))d\(E2it  +  \)) 


a, it) 


dp, it)  dw,it  + 1)  dw.it) 

Next,  we  define  the  change  in  the  parameter  w,  and  rearrange  for  substitution. 

Aw,  it)  =  -a,  it)  — 

dw.it) 

Aw.it)  _d\(E2it  +  \)) 


(16) 


(17) 


(18) 


a, it) 


dw,  it) 
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By  substituting  the  left-hand  side  of  the  second  half  of  equation  (18)  into  equation  (17),  and  then 
deriving  the  equivalent  of  equation  (18)  for  Aw,  (r  +  1)  and  substituting  into  equation  (17),  we 
arrive  at 


d\(E2(t  + 1))  Aw(r  +  1) 

— - -  = - Aw.  (t) 

dm  m 

Thus 


(19) 


A (t  + 1)  =  A  (0  +  0  AWi(t  +  l)  A Wt. (0  (20) 

«i(0 

The  right-hand  side  of  equation  (19)  provides  a  true  unbiased  estimate  of  the  gradient  of  the  error 
surface  with  respect  to  the  memory  parameter,  A-  An  equivalent  of  IDBD  for  nonlinear  systems 
can  trivially  be  derived  from  IDD  by  replacing  Aw,(r)  with  A wi(t),  where  Awi(t)is  an 
exponentially  weighted  sum  of  the  current  and  past  changes  to  w,.  The  trace  is  defined  by  equation 
(21). 


Awi  (t)  =  (1  —  e)Aw;  (t  —  1)  +  £Aw/ ( t  —  1) 


(21) 


This  form  of  IDBD  for  nonlinear  systems  includes  another  free  parameter,  e,  that  determines  the 

decay  rate  of  the  trace.  If  it  were  possible  to  choose  8  perfectly  for  each  training  example,  IDBD 
would,  in  the  worst  case,  be  equivalent  to  IDD,  and  would  on  average  provide  a  better  estimate  of 

the  gradient.  However,  the  optimal  value  of  e  is  a  function  of  the  rate  of  change  in  the  gradient  of 
the  error  surface,  and  is  therefore  different  for  different  regions  of  state  space.  Moreover,  Jacobs’ 
original  motivations  for  using  delta-bar-delta  rather  than  delta-delta  are  no  longer  relevant  when 
each  learning  rate  is  defined  according  to  equation  (10).  For  these  reasons  we  used  IDD  to  speed 
the  convergence  rate  for  our  testbed. 

5.2  IDD  Supervised  Learning  Results 


The  capabilities  of  IDD  were  initially  assessed  using  a  supervised-leaming  task.  The  intent  of  this 
experiment  was  to  answer  the  question:  Does  the  IDD  algorithm  perform  better  than  the  ordinary 
backpropogation  algorithm?  The  task  involved  six  real-valued  inputs  (including  a  bias)  and  one 
output.  The  inputs  were  chosen  independently  and  randomly  in  the  range  [-1,  1].  The  objective 
function  was  the  square  of  the  first  input  summed  with  the  second  input.  The  function 
approximator  was  a  single-hidden-layer  sigmoidal  network  with  5  hidden  nodes.  For  each 
algorithm,  we  trained  the  network  for  50,000  iterations  and  then  measured  the  asymptotic  error. 
This  process  was  repeated  100  times  using  different  initial  random  number  seeds,  and  the  results 
were  averaged.  The  experiment  was  repeated 


for  different  values  of  a  in  the  range  [0.35, 
0.9].  The  equivalent  was  done  for  the  IDD 
algorithm.  The  experiment  was  repeated  for 

values  of  the  meta-learning  rate  0  in  the  range 
[0.1,  1.0]  in  increments  of  0.1.  The  results  are 
presented  in  Figure  1,  and  show  that  the  IDD 
algorithm  finds  a  distribution  of  learning  rates 
that  is  better  than  any  single  learning  rate  shared 
by  all  weights.  Th$  IDD  algorithm  consistently 
reduced  the  error  by  an  order  of  magnitude  more 
than  backprop  alone. 


a 

0,35  0.4  0.5  0.6  0.7  0.8  0.9 


Figure  1:  Comparison  of  IDD  and 
the  backpropogation  algorithms 
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6  SIMULATION  OF  THE  GAME 


6.1  RESIDUAL  ADVANTAGE  LEARNING 


During  training,  a  state  is  chosen  from  a  uniform  random  distribution  on  each  learning  cycle.  The 
vector  of  weights  in  the  function  approximation  system,  W,  is  updated  according  to  equation  (22) 
on  each  learning  cycle. 


l  1 


1- 

A  tK) 


A *m(x,u)-A(x,u) 


1 


Yl  3W  A tK  1 


( 


1- 


1  ^  d 


A tK) 


(22) 


The  parameter  <f>  is  a  constant  that  controls  a  trade-off  between  pure  gradient  descent  (when  <|) 

equals  1)  and  a  fast  direct  algorithm  (when  (j)  equals  0).  A  <()  that  ensures  a  decreasing  mean 
squared  residual,  while  bringing  the  weight  change  vector  as  close  to  the  direct  algorithm  as 
possible  can  be  calculated  by  maintaining  an  estimate  of  the  epoch-wise  weight  change  vectors. 
These  can  be  approximated  by  maintaining  two  scalar  values,  wd  and  wrg  associated  with  each 
weight  w  in  the  function  approximation  system.  These  are  traces,  averages  of  recent  values,  used 

to  approximate  AWd  and  AW^.  The  traces  are  updated  on  each  learning  cycle  according  to: 


W*  <-  (1  -  fi)Wd  -  +  y*1  AninmaxC*’  ,uj)  /  AtK  +  (l  -  l  /  AtK)^ (X,  U)] 


(23) 


Wrg  (1  -  jl)Wrg  -  jl 


(R  +  r“A,ln^V,u))lAtK  + 

(1  -H  AlK)t\m^Jx,u)  -  A(x,u) 


V  ow  ) 

(1  - 1  /  AtK) A™  max  (J:,  u)~4~  Mx,  u) 


(24) 


where  (I  is  a  small,  positive  constant  that  governs  how  fast  the  system  forgets.  On  each  time  step  a 

stable  <|>  is  calculated  by  using  equation  (25).  This  ensures  convergence  while  maintaining  fast 
learning: 


WdWrg 


<P  = 


( Wd  —  Wrg)Wrg 


+  H 


(25) 


It  is  important  to  note  that  this  algorithm  does  not  follow  the  negative  gradient,  which  would  be  the 
steepest  path  of  descent.  However,  the  algorithm  does  cause  the  mean  squared  residual  to  decrease 

monotonically  (for  appropriate  ()>),  thereby  guaranteeing  convergence  to  a  local  minimum  of  the 
mean  squared  Bellman  residual. 
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6.2  GAME  DEFINITION 


Our  system  is  a  differential  game  with  a  missile  pursuing  a  plane,  as  in  Rajan,  Prasad,  and  Rao 
(1980)  and  Millington  (1991).  The  action  performed  by  the  missile  is  a  function  of  the  state.  The 
action  performed  by  the  plane  is  a  function  of  the  state  and  the  action  of  the  missile.  The  use  of  the 
minimax  for  determination  of  policy  guarantees  a  solution  to  the  game  by  ensuring  a  deterministic 
system. 

The  game  is  Markov  with  continuous  states  and  nonlinear  dynamics.  The  state  x  is  a  vector 
(xm,xp)  composed  of  two  elements:  the  state  of  the  missile,  xm>  and  the  state  of  the  plane,  Xp, 
each  of  which  are  vectors  composed  of  four  elements  containing  the  scalar  values  of  the  x  and  y 
coordinates  and  x  and  y  velocities  of  the  players  in  two-dimensional  space.  The  action  u  is  a 
vector  (um,up)  composed  of  two  scalar  elements.  The  element  um  represents  the  action  performed 
by  the  missile,  and  the  element  u p  represents  the  action  performed  by  the  plane;  an  action  value  of 
0.5  indicates  an  instantaneous  90-degree  change  of  heading  to  the  left  of  the  current  heading,  and 
an  action  value  of  -0.5  indicates  an  instantaneous  90-degree  change  of  heading  to  the  right  of  the 
current  heading.  The  next  state  xj+/  is  a  nonlinear  function  of  the  current  state  x(  and  action  ur  The 
speed  of  each  player  is  fixed,  with  the  speed  of  the  missile  twice  that  of  the  plane.  Therefore,  the 
Euclidean  distance  measure  of  the  change  in  position  for  one  state  transition  is  a  fixed  scalar  value 
for  each  player,  with  the  value  for  the  missile  being  twice  that  of  the  value  for  the  plane.  On  each 
time  step  the  heading  of  each  player  is  updated  according  to  the  action  chosen,  the  velocity  in  both 
the  x  and  y  dimensions  is  computed  for  each  player,  and  the  positions  of  the  players  are  updated. 
Pseudocode  describing  these  dynamics  follows: 

deg2rad  -  converts  degrees  to  radians;  //  trigonometric  functions  measured  in  radians 
Plane_action  -  actions  chosen  by  plane; 

Plane_theta  -  current  heading  of  plane  in  2d  space; 

Missile_velocity_X  -  component  vector  velocity  of  missile  in  x  dimension; 

Missile_velocity_Y  -  component  vector  velocity  of  missile  in  y  dimension; 

1)  If  (Plane_action  =  0.5)  then  Plane_theta  =  Plane_theta  +  90  *  deg2rad; 
else  Plane_theta  =  Plane_theta  -  90  *  deg2rad; 

2)  Repeat  step  1  for  Missile. 

3)  Normalize  Plane_theta  and  Missile_theta. 

4)  Missile_velocity_X  =  Missile_speed  *  cos(Missile_theta); 

5)  Missile_velocity_Y  =  Missile_speed  *  sin(Missile_theta); 

6)  Repeat  steps  4  &  5  for  Plane; 

The  reinforcement  function  R  is  a  function  of  the  distance  between  the  players.  A  reinforcement  of 
1  is  given  when  the  Euclidean  distance  between  the  players  is  greater  than  2  units  (plane  escapes). 
A  reinforcement  of  -1  is  given  when  the  distance  is  less  than  0.25  units  (missile  hits  plane).  No 
reinforcement  is  given  when  the  distance  is  in  the  range  [0.25,2].  The  missile  seeks  to  minimize 
reinforcement,  while  the  plane  seeks  to  maximize  reinforcement. 

The  advantage  function  is  approximated  by  a  single-hidden-layer  neural  network  with  50  hidden 
nodes.  The  hidden-layer  nodes  each  have  a  sigmoidal  activation  function,  the  output  of  which  lies 
in  the  range  ['1,1}.  The  output  of  the  network  is  a  linear  combination  of  the  outputs  of  the  hidden- 
layer  nodes  with  their  associated  weights.  To  speed  the  rate  of  convergence  we  used  IDD  as 
described  in  Section  5.  There  are  6  inputs  to  the  network.  The  first  4  inputs  describe  the  state  and 
are  normalized  to  the  range  [-1,1].  They  consist  of  the  differences  in  positions  and  velocities  of  the 
players  in  both  the  x  and  y  dimensions.  The  remaining  inputs  describe  the  action  to  be  taken  by 
each  player,  0.5  and  -0.5  indicate  left  and  right  turns,  respectively. 
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7  RESULTS 


Experiments  were  formulated  to  accomplish  three  objectives.  The  first  objective  was  to  determine 
heuristically  to  what  degree  residual  advantage  learning  could  learn  a  reasonable  policy  for  the 
missile/aircraft  system.  In  Harmon,  Baird,  and  Klopf  (1995)  it  was  possible  to  calculate  the 
optimal  weights  for  the  quadratic  function  approximator  used  to  represent  the  advantage  function. 
This  is  not  die  case  for  the  current  system.  The  nonlinear  dynamics  of  this  system  require  more 
representational  capacity  than  a  simple  quadratic  network  to  store  the  advantage  function.  In  using 
a  single  hidden-layer  sigmoidal  network  we  gain  the  representational  capacity  needed  but  lose  the 
ability  to  calculate  the  optimal  parameters  for  the  network,  which  would  have  been  useful  as  a 
metric.  For  this  reason,  our  metric  is  reduced  to  simple  observation  of  the  system  behavior,  and  is 
analogous  to  the  metric  used  to  evaluate  Tesauro’s  TD-Gammon  (Tesauro,  1990).  Also,  it  is 
possible  this  game  might  be  made  less  difficult  to  solve  if  expressed  in  an  appropriate  coordinate 
system,  such  as  plane  and  missile  centered  polar  coordinates.  However,  the  motivation  for  this 
experiment  is  to  demonstrate  the  ability  of  residual  algorithms  to  solve  difficult,  nonlinear  control 
problems  using  a  general  neural  network.  For  this  reason,  the  game  was  explicitly  structured  to  be 
difficult  to  solve. 

The  second  objective  was  to  analyze  the  performance  of  three  different  forms  of  advantage 

learning:  the  residual  gradient  form,  the  direct  form,  and  a  weighted  average  of  the  two  (values  of  <|> 
in  the  range  [0,  1]).  The  third  and  final  objective  was  to  evaluate  the  utility  of  IDD  for  this  test¬ 
bed,  and  to  address  the  following  question.  When  using  residual  algorithms,  which  method 
increases  the  rate  of  convergence  the  most:  1)  Using  the  residual  gradient  form  of  the  algorithm 

with  a  second  order  method  or,  2)  Using  the  residual  form  of  the  algorithm  with  an  adaptive  (j)  or, 
3)  Some  combination  of  the  two? 

Addressing  the  first  objective,  the  reinforcement  learning  system  implementing  the  residual  form  of 
advantage  learning  produced  a  reasonable  policy  after  800,000  training  cycles.  The  missile  learned 
to  pursue  the  plane,  and  the  plane  learned  to  evade  the  missile.  Interesting  behavior  was  exhibited 
by  both  players  under  certain  initial  conditions.  First,  the  plane  learned  that  in  some  cases  it  is  able 
to  evade  indefinitely  the  missile  by  continuously  flying  in  circles  within  the  missile’s  turn  radius. 
Second,  the  missile  learned  to  anticipate  the  position  of  the  plane.  Rather  than  heading  directly 
toward  the  plane,  the  missile  learned  to  lead  the  plane  under  appropriate  circumstances. 


Figure  2  (a)  Demonstration  of  missile  leading  plane  after  learning  and  (b)  ultimately  hitting  the  plane. 
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Figure  2  (a)  Demonstration  of  the  ability  of  the  plane  to  survive  indefinitely  by  flying  in  continuous  circles  within  the 
missile’s  turn  radius,  (b)  Demonstration  of  the  learned  behavior  of  the  plane  to  turn  toward  the  missile  to  increase  the 
distance  between  the  two  in  the  long  term. 

In  Experiment  2,  the  effects  of  different  values  of  <f>,  the  weighting  factor  used  in  the  linear 
combination  of  the  residual  gradient  update  vector  and  the  direct  method  update  vector,  and  the  use 

of  IDD  on  the  learning  system’s  convergence  rate  were  compared.  For  At  values  of  1.0  and  0.1, 
twelve  different  runs  were  accomplished,  each  using  identical  parameters  with  the  exception  of  the 

weighting  factor  <])  and  the  use  or  non-use  of  IDD.  Figure  3  presents  the  results  of  these 

experiments.  A  <J)  of  1  yields  advantage  learning  in  the  residual  gradient  form,  while  a  <j)  of  0  yields 
advantage  learning  implemented  in  the  direct  form. 


Figure  3:  4>  comparison 

A  <|>  of  1,  which  yields  the  residual  gradient 
— i* IX?  -1  -0  '  algorithm,  minimized  the  mean  squared  Bellman 

...viw.ddAi  -i 

residual  more  than  the  other  values  of  <j)  test, 

including  the  adaptive  <j).  The  use  of  IDD  did 
result  in  a  lower  total  error  in  all  comparisons. 
Furthermore,  the  use  of  IDD  was  imperative  for 
finding  a  control  policy  that  generated  reasonable 
trajectories  for  the  aircraft.  The  use  of  IDD  not 
only  increased  the  rate  of  convergence,  but 

resulted  in  a  smaller  asymptotic  error  level.  When  training  with  a  value  of  0  for  the  parameter  <(), 
reducing  the  algorithm  to  the  direct  method  of  advantage  learning,  the  weights  grew  without  bound 

when  using  a  time  duration.  At,  of  0.1. 

Residual  algorithms  increase  the  speed  of  convergence  by  following  the  gradient  of  the  direct 
method  as  close  as  possible,  while  still  guaranteeing  convergence  to  a  local  minimum  by  ensuring 
the  weight  change  vectors  monotonically  reduce  the  mean  squared  Bellman  residual  error.  By 
combining  residual  algorithms  with  IDD,  we  have  two  separate  mechanisms  pursuing  the  same 
goal:  a  fast  rate  of  convergence.  How  these  mechanisms  interact,  complement,  or  inhibit  one 

another  is  not  fully  understood.  It  was  necessary  to  put  a  ceiling  of  -4  on  the  growth  of  (5  to 

ensure  system  stability.  One  might  think  that  by  simply  decreasing  the  meta-learning  rate  0,  one 

would  not  need  to  add  heuristically  a  ceiling  to  the  parameter  (3.  This  turned  out  not  to  be  the  case. 

Even  for  a  very  small  0,  there  did  exist  (3  that  grew  to  the  point  of  causing  the  system  to  be 
unstable  (the  weights  grew  to  infinity). 


Safe 
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Further  research  is  needed  to  determine  if,  in  general,  using  second-order  methods  with  residual 

gradient  algorithms  is  desirable  over  using  residual  algorithms  with  an  adaptive  <]).  Although  this 
was  the  case  for  the  experiments  described  above,  the  following  results  were  also  generated  using 

the  missile/aircraft  test-bed  using  slightly  different  parameter  settings  (e.g.,  k,  At,  |i).  These 
results  reflect  the  use  of  IDD. 


fixed  $ 

adaptive  $  Figure  4:  The  use  of  an  adaptive  (J)  reduced  the  mean 

squared  Bellman  residual  further  than  any  of  the  tested 
fixed  <j)’s. 

For  this  set  of  experiments  the  Bellman  error  was  minimized  the  most  by  combining  the  use  of  an 

adaptive  <|>  with  the  use  of  IDD.  The  resultant  control  policy  from  these  experiments  also  produced 
aircraft  trajectories  that  looked  reasonable.  It  is  the  case  that  using  IDD  resulted  in  a  lower  mean 

squared  Bellman  error  for  all  values  of  (J),  including  the  adaptive  (f>.  Why  this  is  the  case  and  how 
these  mechanisms  interact  will  be  explored  in  future  research. 

8  CONCLUSIONS 

The  results  gathered  using  the  missile/aircraft  test-bed  provide  evidence  that  residual  forms  of 
reinforcement  learning  algorithms  produce  reinforcement  learning  systems  that  are  stable  and 
converge  to  a  local  minimum  of  the  mean  squared  Bellman  residual  when  using  general  function 
approximation  systems.  In  general,  non-linear  problems  of  this  type  are  difficult  to  solve  with 
classical  game  theory  and  control  theory,  and  therefore  appear  to  be  good  applications  for 
reinforcement  learning. 

The  data  also  suggests  that  the  use  of  second-order  methods  may  be  desirable  or  even  necessary, 
as  was  the  case  for  this  test-bed,  to  generate  the  desired  control  policy.  Although  much  research 
has  been  accomplished  investigating  approaches  for  speeding  rates  of  convergence,  the  results 
gathered  from  applying  these  methods  in  supervised  learning  tasks  may  not  necessarily  hold  true 
for  reinforcement  learning  tasks.  This  stems  from  fundamental  differences  in  the  nature  of 
supervised  and  reinforcement  learning.  For  this  reason,  we  feel  that  a  rigorous  comparison  of 
these  methods  implemented  in  a  reinforcement  learning  system  is  an  appropriate  topic  for  future 
research. 
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